Isolation Forest is an unsupervised machine learning algorithm which is used for anomaly detection (e.g., credit card fraud / money laundering). Isolation forests are similar to random forests in the sense that they both use an ensemble of trees for classification. Instead of decision trees, isolation forests use what are called isolation trees. Once the isolation forest is built, the anomalies are points with the shortest average path length. In isolation forests, it’s much easier to isolate an outlier than it is to isolate in inlier (or “normal” points).
Isolation trees are data structures with the following property:
Let T be a node of an isolation tree. For each node in the tree, it’s either an external node (single node with no children) or an internal node with 2 daughter nodes (\(T_l\), \(T_r\)) with a “test”. On a diagram, \(T_l\) and \(T_r\) would represent the left and right daughter nodes, respectively.
\(\textbf{What’s the “test” in an isolation tree?}\)
Given a sample of data \(X = \{x_1,x_2,..., x_n\}\) with n observations from a d-variate distribution an isolation tree would take an attribute \(q\)of an observation and compare it to a splitting value p. The test is \(q < p\): if \(q < p\), then the point would be assigned to \(T_l\), else it would be assigned to \(T_r\). To build an isolation tree, we recursively divide X by randomly selecting an attribute \(q \in\{1,...,d\}\) and splitting value \(p\), which is some number in between the maximum and minimum of the \(q\)th attribute of all data points. The test splits the data points into \(T_l\) and \(T_r\) and the process continues recursively until either:
\(\quad\) a) A height limit is reached
\(\quad\) b) |X| = 1 (can’t split data anymore)
\(\quad\) c) Or all data in X have the same value
Once the splitting process is over, we’re left with an isolation tree. We then have to tell which points are anomalies and which are “normal” points, which is done by sorting them by their path length and anomaly score.
Path length \(h(x)\): is measured by the number of edges \(x\), a data point, traverses an isolation tree from the root node to an external node.
The anomaly score s of an instance \(x\) is defined as: \(s(x,n) = 2^{-E(h(x)/c(n))}\), where \(E(h(x))\) is the expected \(h(x)\) from a collection of isolation trees, \(c(n) = 2H(n − 1) − (2(n − 1)/n)\); average path length of unsuccessful search in Binary Search Tree, \(H(i)\) is the harmonic number and it can be estimated by \(ln(i) + 0.5772156649\) (Euler’s constant).More about why \(c(n)\) is used in references.
Looking at anomaly score we see that:
As \(E(h(x)) \rightarrow c(n), s \rightarrow 0.5\)
As \(E(h(x)) \rightarrow 0, s \rightarrow 1\)
As \(E(h(x)) \rightarrow n-1, s \rightarrow 0\)
Most notably, this implies that the lower the average path length of a point (i.e., more likely to be an anomaly), the higher the anomaly score. Conversely, the higher the average path length (i.e., more likely to be a normal point), the lower the anomaly score.
Using the anomaly score, s, the following assessment is made:
if instances return s very close to 1, then they are definitely anomalies,
if instances have s much smaller than 0.5, then they are quite safe to be regarded as normal instances, and
if all the instances return \(s \approx 0.5\), then the entire sample does not really have any distinct anomaly.
n_estimators: The number of base estimators (isolation trees) in ensemble (forest)
max_samples: The number of samples to draw from the dataset to train each base estimator.
contamination: The proportion of outliers in the dataset
bootstrap: If True, individual trees are fit on random subsets of the training data sampled with replacement.
ntrees: The number of isolation trees in isolation forest
sample_size: The number of samples to draw from the dataset to train each isolation tree.
ndim: The number of columns to combine to produce a split
missing_action: How to handle missing data at both fitting and prediction time. Set equal to “fail” will assume there’s no missing data and will trigger undefined behavior if it encounters any
scoring_metric: Metric to use for determining anomaly scores
prob_pick_pooled_gain: indicates the probability of choosing the threshold on which to split a variable (with ndim = 1).
Isolation forests are a method of anomaly and outlier detection. We chose a dataset of credit card transactions, as it is useful to identify such anomalies.
First, we plot the data of the time and amount of credit card transactions. The data are from https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud.
# plot data
ggplot(credit, aes(x = Time, y = Amount)) +
geom_point(shape = 1, alpha = 0.5) +
labs(x = "Time", y = "Transaction Amount") +
labs(alpha = "", color = "Legend")
Then, using the isotree package, we create a new
isolation forest with default parameters and fit it to the data.
# create an iForest with default parameters
model_orig <- isolation.forest(
credit,
ndim = 1,
sample_size = 256,
ntrees = 1000,
missing_action = "fail"
)
# make predictions using the iForest model
pred_orig <- predict(model_orig, credit)
Next, we play around with the parameters to make two other types of iForests, known as a Density iForest and a Fair-Cut iForest.
# density iForest
model_dens <- isolation.forest(
credit,
ndim = 1,
sample_size = 256,
ntrees = 1000,
missing_action = "fail",
scoring_metric = "density"
)
# density iForest predictions
pred_dens <- predict(model_dens, credit)
# fair-cut iForest
model_fcf <- isolation.forest(
credit,
ndim = 1,
sample_size = 32,
prob_pick_pooled_gain = 1,
ntrees = 1000,
missing_action = "fail"
)
# fair-cut iForest predictions
pred_fcf <- predict(model_fcf)
In the case of anomalies, our metric of choice is going to be the area under the curve. Accuracy is not the best metric in this case since there are few outliers and classifying the entire dataset as normal points would still yield a high accuracy rate. The ROC curve on the other hand, finds a balance between sensitivity (true positives) and specificity (false positives) giving us better insight into how well our models worked. We compare the models below, with the default iForest performing the best according to the AUROC metric.
# store AUC measurements for each model type
results_df <- data.frame(
Model = c(
"Isolation Forest",
"Density Isolation Forest",
"Fair-Cut Forest"
),
AUROC = c(
AUC(pred_orig, class),
AUC(pred_dens, class),
AUC(pred_fcf, class)
)
)
# display
results_df %>%
kable() %>%
kable_styling()
Lastly, a new plot of the data on a gradient from least likely to be an outlier (yellow) to most likely to be an outlier (purple).
# outlier plot
ggplot(credit, aes(x = Time, y = Amount, color = pred_orig)) +
geom_point(alpha = 0.5, fill = pred_orig) +
labs(x = "Time", y = "Transaction Amount") +
labs(alpha = "", color = "Legend") +
scale_color_gradient(low = "yellow", high = "purple")
We followed the guide in the following link: https://www.kaggle.com/code/norealityshows/outlier-detection-with-isolation-forest-in-r/notebook
The python implementation will use the IsolationForest function from sklearn and the same credit card dataset as the R implementation.
# imports
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report,accuracy_score,roc_auc_score
This dataset has 31 variables:
Time - time of transaction
V1-28 - PCAs
Amount - amount of money in transaction
Class - 0 or 1, where 0 is not a fraud and 1 is a fraud
# reading data
data = pd.read_csv('/Users/Sofia/Downloads/creditcard.csv')
data.head()
## Time V1 V2 V3 ... V27 V28 Amount Class
## 0 0.0 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0
## 1 0.0 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0
## 2 1.0 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0
## 3 1.0 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0
## 4 2.0 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0
##
## [5 rows x 31 columns]
Determine the number of fraud and valid transactions in the entire dataset.
Notice that there are significantly more normal transactions than fraud transactions. The fraud transactions are the anomolies, which is what we are trying to detect.
count_classes = pd.value_counts(data['Class'], sort = True)
count_classes.plot(kind = 'bar', rot=0)
plt.title("Transaction Class Distribution")
plt.xticks(range(2), ["Normal", "Fraud"])
## ([<matplotlib.axis.XTick object at 0x7f927df34760>, <matplotlib.axis.XTick object at 0x7f927df34730>], [Text(0, 0, 'Normal'), Text(1, 0, 'Fraud')])
plt.xlabel("Class")
plt.ylabel("Frequency")
Use binary classification to break transactions into two groups. Any observation with class 0 is considered normal and any observation with class 1 is considered fraud.
#Assigning the transaction class "0 = NORMAL & 1 = FRAUD"
Normal = data[data['Class']==0]
Fraud = data[data['Class']==1]
Partitioning the dataset
X is all columns except the class column
Y is the class column which will be used as the predictor
# Take a sample from the data
data1= data.sample(frac = 0.1,random_state=1)
data1.shape
# Get all the columns from the dataframe
## (28481, 31)
columns = data1.columns.tolist()
# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we are predicting
target = "Class"
# Define a random state
state = np.random.RandomState(42)
X = data1[columns]
Y = data1[target]
X_outliers = state.uniform(low=0, high=1, size=(X.shape[0], X.shape[1]))
Now we will define our isolation forest model. The main idea of isolation forest is to use trees to essentially “isolate” anomolies. Anomolies should not take many trees to be isolated from the rest of the dataset. This isolation forest model is taken from sklearn. The following parameters are:
n_estimators : the number of base estimators in ensemble
max_samples: number of samples to draw from X to train eachbase estimator
contamination: proportion of outliers in dataset
random_state: controls pseudo-randomness of selection of feature and split values
verbose: controls verbosity of tree building process
# isolation forest
Fraud = data1[data1['Class']==1]
Valid = data1[data1['Class']==0]
outlier_fraction = len(Fraud)/float(len(Valid))
iforest = IsolationForest(n_estimators=100, max_samples=len(X), contamination=outlier_fraction,random_state=state, verbose=0)
# Isolation forest model
n_outliers = len(Fraud)
iforest.fit(X)
IsolationForest(contamination=0.0017234102419808666, max_samples=28481,
random_state=RandomState(MT19937) at 0x7F927E36E240)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. IsolationForest(contamination=0.0017234102419808666, max_samples=28481,
random_state=RandomState(MT19937) at 0x7F927E36E240)scores_prediction = iforest.decision_function(X)
y_pred = iforest.predict(X)
# reshape the prediction to 0 and 1
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# accuracy report
print("Isolation Forest:", n_errors)
## Isolation Forest: 73
print("Accuracy Score: ", accuracy_score(Y, y_pred))
## Accuracy Score: 0.9974368877497279
print("ROC curve: ", roc_auc_score(Y, y_pred))
## ROC curve: 0.632002385929048
print("Classification Report :")
## Classification Report :
print(classification_report(Y,y_pred
))
## precision recall f1-score support
##
## 0 1.00 1.00 1.00 28432
## 1 0.26 0.27 0.26 49
##
## accuracy 1.00 28481
## macro avg 0.63 0.63 0.63 28481
## weighted avg 1.00 1.00 1.00 28481
We followed the guide using the following link: https://www.kaggle.com/code/naveengowda16/anomaly-detection-credit-card-fraud-analysis